1. Introduction to Project
Our team’s project focuses on “Analysis and Prediction of Resale Prices of HDB in Singapore.” The objective is to utilize historical data (2013-2023) on the resale of HDB in Singapore to analyze the factors influencing resale prices and to forecast these prices using machine learning models.
1.1 Data Preparation
The data is sourced from the Kaggle dataset titled “Resale Prices of HDB flats from 2012 to 2023.” Through collaborative efforts, our team has completed the data preparation phase. For further details, please visit: link to data preparation. Going forward, I will provide a brief overview of the cleaned dataset.
The following graph shows how we get a new data file from the kaggle dataset.

After the above step, we performed data inspection and cleaning, changed the data types of categorical variables, and removed duplicate rows. At the end of the data cleaning process, we saved the data file as resale_hdb.rds. Now, let’s take a brief look at the cleaned dataset.
Rows: 237,909
Columns: 11
$ month <date> 2013-01-01, 2013-01-01, 2013-01-01, 2013-01-01, 2…
$ town <fct> ANG MO KIO, ANG MO KIO, ANG MO KIO, ANG MO KIO, AN…
$ flat_type <fct> 2 ROOM, 2 ROOM, 2 ROOM, 3 ROOM, 3 ROOM, 3 ROOM, 3 …
$ block <fct> 510, 314, 323, 170, 174, 445, 607, 535, 449, 602, …
$ street_name <fct> ANG MO KIO AVE 8, ANG MO KIO AVE 3, ANG MO KIO AVE…
$ storey_range <fct> 01 TO 03, 01 TO 03, 04 TO 06, 07 TO 09, 01 TO 03, …
$ floor_area_sqm <dbl> 44, 44, 44, 61, 60, 67, 68, 67, 68, 67, 74, 67, 68…
$ flat_model <fct> Improved, Improved, Improved, Improved, Improved, …
$ lease_commence_date <int> 1980, 1978, 1977, 1986, 1986, 1979, 1980, 1980, 19…
$ resale_price <dbl> 253000, 270000, 283000, 305000, 320000, 325000, 32…
$ remaining_lease <dbl> 66, 64, 63, 72, 72, 65, 66, 66, 65, 66, 66, 65, 66…
From the above output, we can see that the dataset we are going to analyze contains 237,909 HDB resale records, including 11 variables. Among them, there are:
4 numerical variables: floor_area_sqm, lease_commence_date, resale_price, remaining_lease;
7 categorical variables: month, town, flat_type, block, street_name, storey_range, flat_model.
Let’s see how many unique values in each variable.
#Count unique values of each variable
unique_counts <- sapply(data, function(x) length(unique(x)))
data.frame(Unique_Values_Count = unique_counts) Unique_Values_Count
month 132
town 26
flat_type 7
block 2699
street_name 567
storey_range 17
floor_area_sqm 176
flat_model 21
lease_commence_date 56
resale_price 4219
remaining_lease 57
Now, I will introduce the definitions of each variable to facilitate your understanding.

1.2 Objective of The Report
- This report will provide a detailed exposition of the following sections of the group project:
- Exploratory Data Analysis
Univariate analysis: aiming to assist users to understand the quantity and distribution of resale HDB properties through different variable.
Bivariate analysis: aiming to assist users to understand the driving factors of the resale price through visualizations.
- Clustering
- Prediction of Resale Price
- Exploratory Data Analysis
- This report will offer UI design for each section in the Shiny App.
2. Exploratory Data Analysis
As above mentioned, this report will present Exploratory Data Analysis (EDA) through two approaches. See the graph below:

Univariate and bivariate are commonly used methods in EDA:
- Univariate Analysis: Univariate Analysis refers to analyzing individual variables. It focuses on the distribution, central tendency, and dispersion of single variables. Common methods for univariate analysis include descriptive statistics, histograms, and box plots. Through univariate analysis, we can understand the properties and characteristics of individual variables without considering the influence of other variables.
The purpose of doing univariate analysis in this project is:
to provide the frequencies and distributions of different variables (such as flat type, housing area, geographical location, etc.), enabling users to understand the resale market situation or popularity of their own HDB flat or second-hand HDB flats they are interested in purchasing on the resale market.
- Bivariate Analysis: Bivariate Analysis involves analyzing the relationship between two variables. It explores the correlation, joint distribution, and trends between two variables. Common methods for bivariate analysis include scatter plots, correlation coefficient analysis, and bivariate regression. Bivariate analysis helps us explore the relationship between two variables and understand their interactions and effects.
The purpose of doing bivariate analysis in this project is:
to explore the relationship between resale price and other variables, helping users understand the driving factors of resale price.
These methods are very useful in exploring our dataset, helping us understand the relationships, characteristics, and patterns of variables, and providing a foundation for further data analysis and modeling.
Based on the above approach, we have designed the user interface for the EDA section in the ShinyApp, as depicted in the following sketch:

Moving forward, let’s delve into more detailed design for each approach.
2.1 Univariate Analysis
2.1.1 Preparation
First of all, let’s identify variables suitable for univariate analysis and find appropriate charts for them. The following chart illustrates how I select suitable variables for univariate analysis and the optimal chart I will use to present to users.

2.1.2 Chart Creation
We can utilize different R packages to create different types of plots.
Now, I will showcase examples of the source code for creating different types of charts.
#Univariate Analysis: by month
# group data by month and count quantity for each month
data_summarized <- data %>%
group_by(month) %>%
summarise(quantity = n())
# create plotly figure
fig <- plot_ly(data_summarized, x = ~month, y = ~quantity, type = "scatter", mode = "lines+markers",
line = list(color = "blue", width = 0.3),
marker = list(color = "red", size = 3))
# add range slider and customize layout
fig <- fig %>%
layout(title = "Resale Quantity by Month (2013-2023)",
xaxis = list(title = "Month in Year",
rangeslider = list(visible = TRUE),
tickformat = "%Y-%m",
tickangle = 60,
tickfont = list(size = 6)),
yaxis = list(title = "Quantity",
tickmode = "linear",
tick0 = 0,
dtick = 500))
# display the plot
fig# Univariate Analysis: by town
# Sort the quantity of HDB resale transaction in different towns from high to low using fct_reorder.
data$town <- fct_reorder(data$town, data$town, function(x) -length(x))
# create a histogram by town
p3 <- ggplot(data, aes(x = town)) +
geom_bar(fill = "steelblue", color = "black", alpha = 0.5) + # Changed geom_histogram to geom_bar
labs(title = "Resale Quantity by town (2013-2023)", x = "town", y = "Quantity") +
theme(axis.text.x = element_text(size = 6, angle = 60, vjust = 0.5))
ggplotly(p3)#Univariate Analysis: by remaining_lease
p3 <- ggplot(data, aes(x = "", y = remaining_lease)) +
geom_boxplot(color = "blue") +
geom_violin(fill = "forestgreen", color = "forestgreen", alpha = 0.3) +
stat_summary(fun = "mean", geom = "point", shape = 18, size = 4, color = "red", position = position_dodge(width = 0.75)) +
labs(title = "Violin Plot of Remaining Lease (2013-2023) (Red dot: Mean Value)", y = "Remaining Lease", x="")
ggplotly(p3)2.1.3 UI Design
In the Shiny app, we will design a feature where users can freely select a variable, view its visualization, and simultaneously see either the count summary or descriptive statistics of that variable.
The demo is as follows:

2.2 Bivariate Analysis
2.2.1 Preparation
First of all, let’s identify variables suitable for univariate analysis and find appropriate charts for them. The following chart illustrates how I select suitable variables for univariate analysis and the optimal chart I will use to present to users.

From the image above, it’s apparent that “month” stands out as a unique variable becaue of its chart type . I will create a separate box for it to display the results in Shiny. Other variables can be represented using box plots, treemaps, and heatmaps. Below, I will illustrate the process of creating visualizations for “month”, categorical variables, and numerical variables.
2.2.2 Chart Creation
We can utilize different R packages to create different types of plots.
Now, I will showcase examples of the source code for creating different types of charts.
Month
#Bivariate Analysis: resale price by month
# Group the data by month and calculate the median resale price for each month.
median_data <- aggregate(resale_price ~ month, data = data, FUN = median)
# Create a plotly figure
fig <- plot_ly(median_data, x = ~month, y = ~resale_price, type = "scatter", mode = "lines+markers",
line = list(color = "blue", width = 0.3),
marker = list(color = "red", size = 3))
# Add range slider
fig <- fig %>%
layout(title = "Median Resale Price by Month (2013-2023)",
xaxis = list(title = "Month", rangeslider = list(visible = TRUE)),
yaxis = list(title = "Median Resale Price"))
# Display the plot
fig# Convert the month column to a date format and extract the month
data$month <- as.Date(paste0(data$month, "-01"))
data$month_group <- format(data$month, "%Y-%m")
# Group the data by month and calculate the median resale price for each month
median_prices <- data %>%
group_by(month_group) %>%
summarise(median_price = median(resale_price))
# Create the boxplot
p <- ggplot(data, aes(x = month_group, y = resale_price)) +
geom_boxplot(fill = "steelblue", color = "grey30", alpha = 0.5) +
geom_line(data = median_prices, aes(x = month_group, y = median_price, group = 1), color = "red", size = 0.5) +
labs(
title = "Resale Price by Month (2013-2023)",
x = "Month",
y = "Resale Price"
) +
theme_dark() +
theme(axis.text.x = element_text(size = 6, angle = 90, vjust = 0.5), legend.position = "none") +
scale_x_discrete(labels = function(x) substr(x, 1, 7)) # Display only the year and month
ggplotly(p)Categorical Variable:
Categorical variables will be plotted are town, flat_type, storey_range, flat_model. I will take town as an example here.
# Prepare the data for treemap
treemap_data <- data %>%
group_by(town) %>%
summarise(count = n(), median_resale_price = median(resale_price))
# Create the basic treemap
treemap(
treemap_data,
index = "town",
vSize = "count",
vColor = "median_resale_price",
type = "value",
bg.labels = "white",
align.labels = list(c("center", "center"), c("right", "bottom")),
title = "Treemap of Median Resale Price by town (2013-2023)",
title.legend = "Median Resale Price",
fontsize.labels = 12
)
# Bivariate Analysis: resale price by town
# Calculate median resale price for each town
median_prices <- data %>%
group_by(town) %>%
summarise(median_price = median(resale_price))
# Reorder the levels of the town variable based on the median resale price
data$town <- factor(data$town, levels = median_prices$town[order(median_prices$median_price)])
p <- ggplot(data, aes(x = town, y = resale_price, fill = town)) +
geom_boxplot(color = "black", alpha = 0.8) +
scale_fill_viridis_d(name = "town", option = "plasma") +
labs(title = "Boxplot of Resale Price by town (2013-2023) (sorted by median)", x = "town", y = "Resale Price") +
theme(axis.text.x = element_text(size = 6, angle = 90, vjust = 0.5), legend.position = "none")
ggplotly(p)# Create the heat map
p <- ggplot(data, aes(x = town, y = resale_price)) +
geom_bin2d(binwidth = c(5, 10000)) +
scale_fill_gradient(low = "lightblue", high = "darkblue") +
labs(
title = "Heatmap of Resale Price by town (2013-2023)",
x = "town",
y = "Resale Price"
) +
theme(axis.text.x = element_text(size = 6, angle = 90, vjust = 0.5))
ggplotly(p)Numerical Variable
Numerical variables will be plotted are floor_area_sqm, lease_commence_date, remaining_lease. I will take floor_area_sqm as an example here.
# Create bins for floor area
data$floor_area_bins <- cut(data$floor_area_sqm, breaks = seq(0, max(data$floor_area_sqm), by = 10))
# Calculate median resale price and count for each floor area bin
treemap_data <- data %>%
group_by(floor_area_bins) %>%
summarise(
median_price = median(resale_price),
count = n()
)
# Create the treemap
treemap(
treemap_data,
index = "floor_area_bins",
vSize = "count",
vColor = "median_price",
type = "value",
title = "Treemap of Median Resale Price by floor_area_sqm (2013-2023)",
title.legend = "Median Resale Price",
bg.labels = "white",
fontsize.labels = 12,
align.labels = list(
c("center", "center"),
c("right", "bottom")
)
)
# Create bins for floor area
data$floor_area_bins <- cut(data$floor_area_sqm, breaks = seq(0, max(data$floor_area_sqm), by = 10))
# Calculate median resale price for each floor area bin
median_prices <- data %>%
group_by(floor_area_bins) %>%
summarise(median_price = median(resale_price))
# Reorder the levels of the floor area bins based on the median resale price
data$floor_area_bins <- factor(data$floor_area_bins, levels = median_prices$floor_area_bins[order(median_prices$median_price)])
# Create the box plot using ggplot2
p <- ggplot(data, aes(x = floor_area_bins, y = resale_price, fill = floor_area_bins)) +
geom_boxplot(color = "black", alpha = 0.8) +
scale_fill_viridis_d(name = "Floor Area (sqm)", option = "plasma") +
labs(
title = "Boxplot of Resale Price by floor_area_sqm (2013-2023) (sorted by median)",
x = "floor_area_sqm",
y = "Resale Price"
) +
theme(axis.text.x = element_text(size = 6, angle = 90, vjust = 0.5), legend.position = "none")
ggplotly(p)# Create the heatmap
p <- ggplot(data, aes(x = floor_area_sqm, y = resale_price)) +
geom_bin2d(binwidth = c(5, 10000)) +
scale_fill_gradient(low = "lightblue", high = "darkblue", name = "Count") +
scale_x_continuous(breaks = seq(0, max(data$floor_area_sqm), by = 10)) +
labs(
title = "Heatmap of Resale Price by floor_area_sqm (2013-2023)",
x = "floor_area_sqm",
y = "Resale Price"
) +
theme(axis.text.x = element_text(angle = 90, hjust = 1),legend.position = "right")
ggplotly(p)2.2.3 UI Design
In the Shiny app, we will design a feature where users can freely select a variable, then visualize bivariate analysis to see how the variable influences resale price.
The demo is as follows:

3. Clustering
3.1 Check Missing Value
# read the data
data <- readRDS("data/resale_hdb.rds")
# check for missing values
missing_values <- colSums(is.na(data))
# print the result
print(missing_values) month town flat_type block
0 0 0 0
street_name storey_range floor_area_sqm flat_model
0 0 0 0
lease_commence_date resale_price remaining_lease
0 0 0
From above output, we can see that there is no missing value in the data set.
3.2 Remove Outliers
# Use boxplot to check outliers
data <- readRDS("data/resale_hdb.rds")
# Get names of numeric columns (assumed to be continuous variables)
continuous_vars <- names(Filter(is.numeric, data))
# Generate boxplot for all continuous variables
boxplot(data[continuous_vars],cex.axis=0.7)
The result shows many outliers in resale_price, now let’s remove them.
# define a function to remove the outliers
remove_outliers <- function(x, na.rm = TRUE, ...) {
qnt <- quantile(x, probs=c(.25, .75), na.rm = na.rm, ...)
H <- 1.5 * IQR(x, na.rm = na.rm)
y <- x
y[x < (qnt[1] - H)] <- NA
y[x > (qnt[2] + H)] <- NA
y
}
columns_to_remove_outliers <- c("resale_price")
for (col in columns_to_remove_outliers) {
data[[col]] <- remove_outliers(data[[col]])
}
# check boxplot after removal
boxplot(data[continuous_vars],cex.axis=0.7)
You can see that there are still some outliers in the resale_price. Let’s remove them again.
for (col in columns_to_remove_outliers) {
data[[col]] <- remove_outliers(data[[col]])
}
# check boxplot after removal
boxplot(data[continuous_vars],cex.axis=0.7)
Now we assure that there are no more outliers in any of the continuous variables in the dataset.
3.3 Feature Selection
We will select appropriate variables for ease of clustering analysis and determine whether they need to be encoded in advance, as follows:

From the above, the variables we have selected are: town, flat_type, storey_range, flat_model, floor_area_sqm, lease_commence_date, and remaining_lease. Among these, categorical variables such as town, flat_type, storey_range, and flat_model need to be encoded first.
3.4 Encoding
Since the variables we need to encode have many categories, we can use frequency encoding to replace each category with its frequency of occurrence. This approach helps retain some information about the categories without introducing too many new features.
# Define the frequency encoding function
frequency_encoding <- function(data, variable) {
variable_name <- deparse(substitute(variable))
# Calculate the frequency of each category
freq_table <- data %>%
group_by(!!sym(variable_name)) %>%
summarise(freq = n()) %>%
arrange(freq)
# Create the encoding mapping with class values
encoding_map <- freq_table %>%
mutate(!!paste0(variable_name, "_freq_encoding") := row_number(),
!!paste0(variable_name, "_class") := !!sym(variable_name))
# Merge the encoding mapping with the original data
data <- data %>%
left_join(encoding_map, by = variable_name)
return(data)
}
# Apply frequency encoding to specified variables
data <- frequency_encoding(data, town)
data <- frequency_encoding(data, flat_type)
data <- frequency_encoding(data, storey_range)
data <- frequency_encoding(data, flat_model)
#remove NA.
data <- na.omit(data)
glimpse(data)Rows: 229,338
Columns: 23
$ month <date> 2013-01-01, 2013-01-01, 2013-01-01, 2013-0…
$ town <fct> ANG MO KIO, ANG MO KIO, ANG MO KIO, ANG MO …
$ flat_type <fct> 2 ROOM, 2 ROOM, 2 ROOM, 3 ROOM, 3 ROOM, 3 R…
$ block <fct> 510, 314, 323, 170, 174, 445, 607, 535, 449…
$ street_name <fct> ANG MO KIO AVE 8, ANG MO KIO AVE 3, ANG MO …
$ storey_range <fct> 01 TO 03, 01 TO 03, 04 TO 06, 07 TO 09, 01 …
$ floor_area_sqm <dbl> 44, 44, 44, 61, 60, 67, 68, 67, 68, 67, 74,…
$ flat_model <fct> Improved, Improved, Improved, Improved, Imp…
$ lease_commence_date <int> 1980, 1978, 1977, 1986, 1986, 1979, 1980, 1…
$ resale_price <dbl> 253000, 270000, 283000, 305000, 320000, 325…
$ remaining_lease <dbl> 66, 64, 63, 72, 72, 65, 66, 66, 65, 66, 66,…
$ freq.x <int> 10438, 10438, 10438, 10438, 10438, 10438, 1…
$ town_freq_encoding <int> 17, 17, 17, 17, 17, 17, 17, 17, 17, 17, 17,…
$ town_class <fct> ANG MO KIO, ANG MO KIO, ANG MO KIO, ANG MO …
$ freq.y <int> 3585, 3585, 3585, 59644, 59644, 59644, 5964…
$ flat_type_freq_encoding <int> 3, 3, 3, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6…
$ flat_type_class <fct> 2 ROOM, 2 ROOM, 2 ROOM, 3 ROOM, 3 ROOM, 3 R…
$ freq.x.x <int> 43305, 43305, 55755, 50769, 43305, 55755, 5…
$ storey_range_freq_encoding <int> 14, 14, 17, 16, 14, 17, 17, 14, 17, 17, 14,…
$ storey_range_class <fct> 01 TO 03, 01 TO 03, 04 TO 06, 07 TO 09, 01 …
$ freq.y.y <int> 59487, 59487, 59487, 59487, 59487, 33405, 3…
$ flat_model_freq_encoding <int> 20, 20, 20, 20, 20, 19, 19, 19, 19, 19, 19,…
$ flat_model_class <fct> Improved, Improved, Improved, Improved, Imp…
Now, we have selected the variables to be used for clustering: floor_area_sqm, lease_commence_date, resale_price, remaining_lease, town_freq_encoding, flat_type_freq_encoding, storey_range_freq_encoding, and flat_model_freq_encoding.
3.5 K-means Clustering
3.5.1 Doing K-means Clustering
# doing K-means clustering
kmeans_result <- kmeans(data_cl, centers = 5, nstart = 5) #users can change centers and nstarts in shiny app. here use 5 as an example.
# add result to original dataframe
data$km_cluster <- kmeans_result$cluster
cl_result <- data %>%
select(month, town, flat_type, storey_range, floor_area_sqm, flat_model, lease_commence_date, resale_price, remaining_lease, km_cluster)
datatable(head(cl_result))3.5.2 BIC & AIC
3.5.3 Visualize the Result
Cluster Propotion
# count data points of each cluster
cluster_counts <- data.frame(table(data$km_cluster))
colnames(cluster_counts) <- c("Cluster", "Count")
# create an interactive pie chart to visualize the result
plot_ly(cluster_counts, labels = ~Cluster, values = ~Count,
type = 'pie', textposition = 'inside', textinfo = 'percent',
insidetextfont = list(color = '#FFFFFF'),
hoverinfo = 'text',
text = ~paste('Cluster:', Cluster, '<br>Count:', Count),
marker = list(colors = brewer.pal(length(unique(cluster_counts$Cluster)), "Set1"),
line = list(color = '#FFFFFF', width = 1)),
showlegend = FALSE) %>%
layout(title = 'Percentage of Data in Cluster',
xaxis = list(showgrid = FALSE, zeroline = FALSE, showticklabels = FALSE),
yaxis = list(showgrid = FALSE, zeroline = FALSE, showticklabels = FALSE))Cluster Charactistic
# Bin continuous variables
data$floor_area_sqm_bin <- cut(data$floor_area_sqm, breaks = seq(0, max(data$floor_area_sqm), by = 10), include.lowest = TRUE, right = FALSE)
data$lease_commence_date_bin <- cut(data$lease_commence_date, breaks = seq(min(data$lease_commence_date), max(data$lease_commence_date), by = 10), include.lowest = TRUE, right = FALSE)
data$remaining_lease_bin <- cut(data$remaining_lease, breaks = seq(0, max(data$remaining_lease), by = 10), include.lowest = TRUE, right = FALSE)
data$resale_price_bin <- cut(data$resale_price,
breaks = seq(min(data$resale_price), max(data$resale_price), by = 100000),
include.lowest = TRUE, right = FALSE)
# Define variables to plot
variables_to_plot <- c("town", "flat_type", "storey_range", "flat_model", "floor_area_sqm_bin", "lease_commence_date_bin", "remaining_lease_bin", "resale_price_bin")
# Create subplot function
create_subplot <- function(variable) {
# Filter out missing values
data_filtered <- data %>%
filter(!is.na(get(variable)))
# Calculate count of each category in each cluster
count_by_cluster_category <- data_filtered %>%
group_by(km_cluster, !!sym(variable)) %>%
summarise(count = n(), .groups = "drop")
# Calculate proportion of each category in each cluster
proportion_by_cluster_category <- count_by_cluster_category %>%
group_by(km_cluster) %>%
mutate(proportion = count / sum(count))
plot_ly(proportion_by_cluster_category,
x = ~factor(km_cluster),
y = ~proportion,
color = as.formula(paste0("~", variable)),
type = "bar",
text = ~paste("<b>", variable, "</b><br>",
format(get(variable), scientific = FALSE), ": ", round(proportion * 100, 2), "%"),
hoverinfo = "text",
name = variable) %>%
layout(title = paste("Proportion of", variable, "in Each Cluster"),
xaxis = list(title = variable),
yaxis = list(title = "Proportion"),
barmode = "stack",
showlegend = TRUE,
hovermode = "x unified",
font = list(size = 8))
}
# Create subplots
subplots <- lapply(variables_to_plot, create_subplot)
# Combine subplots
subplot(subplots, nrows = 1, shareX= TRUE, shareY = TRUE) %>%
layout(title = "Cluster Composition", showlegend = FALSE)Cluster Charactistic Individual
# Calculate the count of each town in each cluster
data_count <- data %>%
group_by(km_cluster, flat_type) %>%
summarise(count = n(), .groups = "drop")
# Calculate the proportion of each town in each cluster
data_prop <- data_count %>%
group_by(km_cluster) %>%
mutate(proportion = count / sum(count))
# Create an interactive stacked bar plot
plot_ly(data_prop, x = ~factor(km_cluster), y = ~proportion, color = ~flat_type,
type = "bar", text = ~paste0(flat_type, ": ", round(proportion * 100, 2), "%"),
hoverinfo = "text") %>%
layout(title = "Proportion of Towns in Each Cluster",
xaxis = list(title = "Cluster"),
yaxis = list(title = "Proportion"),
barmode = "stack",
legend = list(title = list(text = "Town")),
hovermode = "x unified") %>%
config(displayModeBar = TRUE, modeBarButtonsToRemove = c("sendDataToCloud", "lasso2d", "select2d", "autoScale2d"))3.6 UI design
In Shiny app, this part will be designed as follows:



